This is the first in a series of notebooks designed to show you how to analyze social media data. We assume you have already downloaded the data and are now ready to begin examining it. In this first notebook I will show you how to set up your ipython working environment and import the Twitter data we have downloaded.

Chapter 1: Set up Jupyter, Import Twitter Data and Select Cases¶

First, we will import several necessary Python packages. We will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations. It is invaluable for analyzing datasets.

Import packages¶

In [1]:

import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [2]:

#Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)

We can check which version of various packages we're using. You can see I'm running PANDAS 0.13 here.

In [86]:

print pd.__version__

0.13.1

Read in data¶

PANDAS can read in data from a variety of different data types. If you've followed some of my earlier tutorials you have downloaded tweets into an SQLite database, then converted to a CSV file. That's what we have here. We have a set of tweets by Fortune 200 firms. So, in the following three lines we'll first import the CSV file and assign it to the name 'df' -- short for 'dataframe', the PANDAS name for a dataset. Second, we'll use the len function to see how many rows (tweets) there are in the dataset; there are 34,097 tweets in total. Finally, we will use the head function to show the first two rows of the dataset.

In [67]:

df = pd.read_csv('CSR_user_timeline_2013.csv', sep=',', low_memory=False)
print len(df)
df.head(2)

Out[67]:

	rowid	query	tweet_id	tweet_id_str	inserted_date	truncated	language	possibly_sensitive	coordinates	retweeted_status	withheld_in_countries	withheld_scope	created_at_text	created_at	month	year	content	from_user_screen_name	from_user_id	from_user_followers_count	from_user_friends_count	from_user_listed_count	from_user_favourites_count	from_user_statuses_count	from_user_description	from_user_location	from_user_created_at	retweet_count	favorite_count	entities_urls	entities_urls_count	entities_hashtags	entities_hashtags_count	entities_mentions	entities_mentions_count	in_reply_to_screen_name	in_reply_to_status_id	source	entities_expanded_urls	entities_media_count	media_expanded_url	media_url	media_type	video_link	photo_link	twitpic	num_characters	num_words	retweeted_user	retweeted_user_description	retweeted_user_screen_name	retweeted_user_followers_count	retweeted_user_listed_count	retweeted_user_statuses_count	retweeted_user_location	retweeted_tweet_created_at	Fortune_2012_rank	Company	CSR_sustainability	specific_project_initiative_area
0	67340	humanavitality	306897327585652736	306897327585652736	2014-03-09 13:46:50.222857	0	en	NaN	NaN	NaN	NaN	NaN	Wed Feb 27 22:43:19 +0000 2013	2013-02-27 22:43:19.000000	2	2013	@louloushive (Tweet 2) We encourage other empl...	humanavitality	274041023	2859	440	38	25	1766	This is the official Twitter account for Human...	NaN	Tue Mar 29 16:23:02 +0000 2011	0	0	NaN	0	NaN	0	louloushive	1	louloushive	3.062183e+17	web	NaN	NaN	NaN	NaN	NaN	0	0	0	121	19	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	79	Humana	0	1
1	39454	FundacionPfizer	308616393706844160	308616393706844160	2014-03-09 13:38:20.679967	0	es	NaN	NaN	NaN	NaN	NaN	Mon Mar 04 16:34:17 +0000 2013	2013-03-04 16:34:17.000000	3	2013	¿Sabes por qué la #vacuna contra la #neumonía ...	FundacionPfizer	188384056	2464	597	50	11	2400	Noticias sobre Responsabilidad Social y Fundac...	México	Wed Sep 08 16:14:11 +0000 2010	1	0	NaN	0	vacuna, neumonía	2	NaN	0	NaN	NaN	web	NaN	NaN	NaN	NaN	NaN	0	0	0	138	20	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40	Pfizer	0	1

2 rows × 60 columns

List all the columns in the DataFrame

In [68]:

df.columns

Out[68]:

Index([u'rowid', u'query', u'tweet_id', u'tweet_id_str', u'inserted_date', u'truncated', u'language', u'possibly_sensitive', u'coordinates', u'retweeted_status', u'withheld_in_countries', u'withheld_scope', u'created_at_text', u'created_at', u'month', u'year', u'content', u'from_user_screen_name', u'from_user_id', u'from_user_followers_count', u'from_user_friends_count', u'from_user_listed_count', u'from_user_favourites_count', u'from_user_statuses_count', u'from_user_description', u'from_user_location', u'from_user_created_at', u'retweet_count', u'favorite_count', u'entities_urls', u'entities_urls_count', u'entities_hashtags', u'entities_hashtags_count', u'entities_mentions', u'entities_mentions_count', u'in_reply_to_screen_name', u'in_reply_to_status_id', u'source', u'entities_expanded_urls', u'entities_media_count', u'media_expanded_url', u'media_url', u'media_type', u'video_link', u'photo_link', u'twitpic', u'num_characters', u'num_words', u'retweeted_user', u'retweeted_user_description', u'retweeted_user_screen_name', u'retweeted_user_followers_count', u'retweeted_user_listed_count', u'retweeted_user_statuses_count', u'retweeted_user_location', u'retweeted_tweet_created_at', u'Fortune_2012_rank', u'Company', u'CSR_sustainability', u'specific_project_initiative_area'], dtype='object')

We can use the len function again here to see how many columns there are in the dataframe: 60.

In [69]:

len(df.columns)

Out[69]:

We should also inspect the format for our columns. We can see that some are integers, some are 'float' (can have a decimal), and some are 'objects' (text). If you have a identifying text variable that has accidentally been imported as a float, for instance, that could cause problems down the road, so you should fix it before continuing.

In [85]:

df.dtypes

Out[85]:

rowid                                 int64
query                                object
tweet_id_str                          int64
inserted_date                        object
language                             object
coordinates                          object
retweeted_status                     object
created_at                           object
month                                 int64
year                                  int64
content                              object
from_user_screen_name                object
from_user_id                          int64
from_user_followers_count             int64
from_user_friends_count               int64
from_user_listed_count                int64
from_user_favourites_count            int64
from_user_statuses_count              int64
from_user_description                object
from_user_location                   object
from_user_created_at                 object
retweet_count                         int64
favorite_count                        int64
entities_urls                        object
entities_urls_count                   int64
entities_hashtags                    object
entities_hashtags_count               int64
entities_mentions                    object
entities_mentions_count               int64
in_reply_to_screen_name              object
in_reply_to_status_id               float64
source                               object
entities_expanded_urls               object
entities_media_count                float64
media_expanded_url                   object
media_url                            object
media_type                           object
video_link                            int64
photo_link                            int64
twitpic                               int64
num_characters                        int64
num_words                             int64
retweeted_user                      float64
retweeted_user_description           object
retweeted_user_screen_name           object
retweeted_user_followers_count      float64
retweeted_user_listed_count         float64
retweeted_user_statuses_count       float64
retweeted_user_location              object
retweeted_tweet_created_at           object
Fortune_2012_rank                     int64
Company                              object
CSR_sustainability                    int64
specific_project_initiative_area      int64
Length: 54, dtype: object

Remove Unneeded Columns¶

Every researcher will have different preferences about keeping or deleting unneeded columns. You might want to delete variables to make your dataset cleaner, you might want to save memory, or you might want a smaller dataset for some specific analyses. In any case, we can use the drop command to delete individual columns. Let's drop six that are not needed here. The first two have duplicate columns in another format, while the latter four all have zero variation (all are blank).

In [70]:

df = df.drop('created_at_text',1)
df = df.drop('tweet_id',1)
df = df.drop('withheld_in_countries',1)
df = df.drop('withheld_scope',1)
df = df.drop('truncated',1)
df = df.drop('possibly_sensitive',1)

There are now 54 columns in the dataframe.

In [71]:

len(df.columns)

Out[71]:

In [72]:

df.head(2)

Out[72]:

	rowid	query	tweet_id_str	inserted_date	language	coordinates	retweeted_status	created_at	month	year	content	from_user_screen_name	from_user_id	from_user_followers_count	from_user_friends_count	from_user_listed_count	from_user_favourites_count	from_user_statuses_count	from_user_description	from_user_location	from_user_created_at	retweet_count	favorite_count	entities_urls	entities_urls_count	entities_hashtags	entities_hashtags_count	entities_mentions	entities_mentions_count	in_reply_to_screen_name	in_reply_to_status_id	source	entities_expanded_urls	entities_media_count	media_expanded_url	media_url	media_type	video_link	photo_link	twitpic	num_characters	num_words	retweeted_user	retweeted_user_description	retweeted_user_screen_name	retweeted_user_followers_count	retweeted_user_listed_count	retweeted_user_statuses_count	retweeted_user_location	retweeted_tweet_created_at	Fortune_2012_rank	Company	CSR_sustainability	specific_project_initiative_area
0	67340	humanavitality	306897327585652736	2014-03-09 13:46:50.222857	en	NaN	NaN	2013-02-27 22:43:19.000000	2	2013	@louloushive (Tweet 2) We encourage other empl...	humanavitality	274041023	2859	440	38	25	1766	This is the official Twitter account for Human...	NaN	Tue Mar 29 16:23:02 +0000 2011	0	0	NaN	0	NaN	0	louloushive	1	louloushive	3.062183e+17	web	NaN	NaN	NaN	NaN	NaN	0	0	0	121	19	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	79	Humana	0	1
1	39454	FundacionPfizer	308616393706844160	2014-03-09 13:38:20.679967	es	NaN	NaN	2013-03-04 16:34:17.000000	3	2013	¿Sabes por qué la #vacuna contra la #neumonía ...	FundacionPfizer	188384056	2464	597	50	11	2400	Noticias sobre Responsabilidad Social y Fundac...	México	Wed Sep 08 16:14:11 +0000 2010	1	0	NaN	0	vacuna, neumonía	2	NaN	0	NaN	NaN	web	NaN	NaN	NaN	NaN	NaN	0	0	0	138	20	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40	Pfizer	0	1

2 rows × 54 columns

If you have only a few columns to delete you can use the drop command as shown above. On the other hand, if you only want to keep a few columns, you can create a new version of the dataframe with only those columns you like. Note that the double square brackets -- "[[...]]" -- in PANDAS forms a dataframe representation. In the following example, I am creating a new dataframe with only three variables. You can see that this new dataframe has the same number of tweets but fewer columns (variables).

In [73]:

df2 = df[['created_at', 'from_user_screen_name', 'retweet_count']]
print len(df2)
df2.head(2)

Out[73]:

	created_at	from_user_screen_name	retweet_count
0	2013-02-27 22:43:19.000000	humanavitality	0
1	2013-03-04 16:34:17.000000	FundacionPfizer	1

2 rows × 3 columns

View Twitter Accounts Represented in DF¶

We can use the unique function to find how many unique Twitter accounts are represented in the dataset. First, I'll show you what unique function does -- it creates an array of all the screen_names of the Twitter accounts.

In [74]:

pd.unique(df.from_user_screen_name.ravel())

Out[74]:

array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR',
       'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW',
       'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship',
       'TICalculators', 'CiscoEDU', 'DuPont_ability', 'Dell4Good',
       'verizongiving', 'DellEDU', 'SprintGreenNews', 'TeachingMoney',
       'WalmartGreen', 'ecomagination', 'WalmartAction', 'Microsoft_Green',
       'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen',
       'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation',
       'nikebetterworld', 'HoneywellBuild', 'googlestudents',
       '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower',
       'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)

Note again how we can use len to find out how many accounts are in the array: 42

In [75]:

len(pd.unique(df.from_user_screen_name.ravel()))

Out[75]:

Remove Tweets from One Specific Account¶

We want to get rid of all tweets by TICalculators from the dataframe. Unlike the other 41 Twitter accounts in the dataset, this account is not a CSR-related account. First, we can use the len function combined with a dataframe query to count the number of tweets that are not sent by TICalculators: 32,300

In [76]:

len(df[df['from_user_screen_name'] != 'TICalculators'])

Out[76]:

We should then also check how many tweets are sent by TICalculators: 1,767

In [77]:

len(df[df['from_user_screen_name'] == 'TICalculators'])

Out[77]:

We can use Python to do "math." Let's use this to show whether the two numbers returned in the above steps add up to the total number of tweets in our dataframe. They do. While this may seem like an unnecessary step, it is always critical to perform such basic data checks in order to avert unexpected data disasters.

In [78]:

1767 + 32330

Out[78]:

We can also do this another way

In [79]:

(1767 + 32330) - len(df)

Out[79]:

Or even

In [80]:

len(df[df['from_user_screen_name'] != 'TICalculators']) + len(df[df['from_user_screen_name'] == 'TICalculators']) - len(df)

Out[80]:

Remove Tweets by TICalculators¶

In the next block of code we will create a new version of our dataframe, this time limiting it to only those tweets that are not sent by TICalculators. As we can see, there are now 32,300 tweets in this dataframe, the same number as we calculated above.

In [81]:

df = df[df['from_user_screen_name'] != 'TICalculators']
print len(df)
df.head(2)

Out[81]:

	rowid	query	tweet_id_str	inserted_date	language	coordinates	retweeted_status	created_at	month	year	content	from_user_screen_name	from_user_id	from_user_followers_count	from_user_friends_count	from_user_listed_count	from_user_favourites_count	from_user_statuses_count	from_user_description	from_user_location	from_user_created_at	retweet_count	favorite_count	entities_urls	entities_urls_count	entities_hashtags	entities_hashtags_count	entities_mentions	entities_mentions_count	in_reply_to_screen_name	in_reply_to_status_id	source	entities_expanded_urls	entities_media_count	media_expanded_url	media_url	media_type	video_link	photo_link	twitpic	num_characters	num_words	retweeted_user	retweeted_user_description	retweeted_user_screen_name	retweeted_user_followers_count	retweeted_user_listed_count	retweeted_user_statuses_count	retweeted_user_location	retweeted_tweet_created_at	Fortune_2012_rank	Company	CSR_sustainability	specific_project_initiative_area
0	67340	humanavitality	306897327585652736	2014-03-09 13:46:50.222857	en	NaN	NaN	2013-02-27 22:43:19.000000	2	2013	@louloushive (Tweet 2) We encourage other empl...	humanavitality	274041023	2859	440	38	25	1766	This is the official Twitter account for Human...	NaN	Tue Mar 29 16:23:02 +0000 2011	0	0	NaN	0	NaN	0	louloushive	1	louloushive	3.062183e+17	web	NaN	NaN	NaN	NaN	NaN	0	0	0	121	19	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	79	Humana	0	1
1	39454	FundacionPfizer	308616393706844160	2014-03-09 13:38:20.679967	es	NaN	NaN	2013-03-04 16:34:17.000000	3	2013	¿Sabes por qué la #vacuna contra la #neumonía ...	FundacionPfizer	188384056	2464	597	50	11	2400	Noticias sobre Responsabilidad Social y Fundac...	México	Wed Sep 08 16:14:11 +0000 2010	1	0	NaN	0	vacuna, neumonía	2	NaN	0	NaN	NaN	web	NaN	NaN	NaN	NaN	NaN	0	0	0	138	20	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40	Pfizer	0	1

2 rows × 54 columns

Now let's check again for all the unique accounts in the dataframe -- as you can see, TICalculators is gone and there are now 41 accounts.

In [82]:

pd.unique(df.from_user_screen_name.ravel())

Out[82]:

array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR',
       'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW',
       'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship',
       'CiscoEDU', 'DuPont_ability', 'Dell4Good', 'verizongiving',
       'DellEDU', 'SprintGreenNews', 'TeachingMoney', 'WalmartGreen',
       'ecomagination', 'WalmartAction', 'Microsoft_Green',
       'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen',
       'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation',
       'nikebetterworld', 'HoneywellBuild', 'googlestudents',
       '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower',
       'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)

In [83]:

len(pd.unique(df.from_user_screen_name.ravel()))

Out[83]:

We now have our new dataframe without TICalculators. We are left with 32,330 tweets sent by 41 companies over the course of 2013.

Save New DataFrame¶

We will now save the dataframe in PANDAS' native format. It's called 'pickling' a file, so we'll give it the typical 'pkl' extension.

In [87]:

df.to_pickle('CSR tweets - 2013 by 41 accounts.pkl')

For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton